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Abstract 

The  Certified  Reduced  Basis  method  (herein  RB  method)  is  a  popular  approach  for  model  reduction  of 
parametrized  partial  differential  equations.  In  this  paper  we  introduce  new  techniques  that  are  required 
to  efficiently  implement  the  Offline  “Construction  stage”  of  the  RB  method  on  high-performance  parallel 
supercomputers.  This  enables  us  to  generate  certified  RB  models  for  large-scale  three-dimensional  problems 
that  can  be  evaluated  on  standard  workstations  and  other  “thin”  computing  resources  with  speedup  of  many 
orders  of  magnitude  compared  to  the  corresponding  full  order  model.  We  use  our  implementation  to  perform 
detailed  numerical  studies  for  two  computationally  expensive  model  problems:  a  natural  convection  fluid 
flow  problem  and  a  “many  parameter”  heat  transfer  problem.  In  the  heat  transfer  problem,  we  exploit  the 
computational  efficiency  of  the  RB  method  to  perform  a  detailed  study  of  “snapshot”  selection  in  the  Greedy 
algorithm,  and  we  also  examine  statistics  of  the  output  sensitivity  derivatives  to  obtain  a  “global”  view  of 
the  relative  importance  of  the  parameters. 

Keywords:  Reduced  basis  method,  a  posteriori  error  bounds,  finite  element  methods,  high-performance 
computing,  Boussinesq  equation,  heat  transfer,  sensitivity  derivatives 


1.  Introduction 

High-fidelity  computational  simulation  of  parametrized  partial  differential  equations  (PDEs)  is  funda¬ 
mentally  important  in  many  scientific  and  engineering  contexts.  However,  classical  high-fidelity  numerical 
discretizations  —  finite  element  methods,  finite  difference  methods,  spectral  methods  —  are  computation¬ 
ally  expensive  for  large-scale  problems.  This  limits  their  applicability  in  situations  where  a  PDE  needs  to 
be  solved  for  many  different  parameters  (optimization,  inverse  problems,  multiscale  analysis),  or  in  which 
real-time  evaluation  is  desirable  (control  applications,  “in  situ”  design).  To  address  this  limitation,  a  large 
body  of  model  reduction  literature  has  emerged  [1,  2,  3,  4].  The  basic  goal  in  model  reduction  is  to  de¬ 
velop  a  low-dimensional  representation  that  accurately  captures  the  dynamics  of  a  high-fidelity  numerical 
discretization  over  a  specific  parameter  regime  of  interest.  One  well-established  approach  to  model  reduc¬ 
tion  of  parametrized  PDEs  is  the  certified  Reduced  Basis  (RB)  method  [5,  6,  7,  8,  9].  In  the  RB  method, 
a  very  low-dimensional  approximation  to  the  “solution  manifold”  of  a  parametrized  PDE  is  constructed, 
typically  resulting  in  a  reduced  order  model  that  achieves  orders  of  magnitude  savings  in  computation  time 
and  memory. 

Throughout  this  paper  we  shall  assume  there  exists  a  sufficiently  accurate  (for  our  purposes)  finite  element 
discretization  of  a  parametrized  PDE,  and  that  we  seek  to  construct  a  reduced-order  model  based  on  this 
discretization  via  the  RB  method.  Employing  the  standard  RB  nomenclature,  we  shall  henceforth  refer  to 
the  high-fidelity  discretization  as  the  “truth”  model.  It  has  been  well-established  [9]  that,  for  a  class  of  PDEs 
that  satisfy  the  “affine  hypothesis,”  the  RB  method  allows  rapid  and  highly-accurate  approximation  of  the 
high-fidelity  (truth)  model  and,  moreover,  incorporates  inexpensive  a  posteriori  error  bounds  that  provide  a 
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rigorous  measure  of  the  error  incurred  in  the  model  reduction  process.  The  focus  of  this  paper,  however,  is 
to  demonstrate  that  by  exploiting  opportunities  for  parallelism  in  the  RB  framework,  we  can  realize  the  full 
potential  of  the  RB  methodology  for  large-scale  PDE  simulations.  The  “pay-off”  of  model  reduction  clearly 
increases  in  direct  proportion  to  the  complexity  and  expense  of  the  full-fidelity  simulation. 

The  RB  framework  is  naturally  separated  into  an  “inexpensive”  Online  stage  and  an  “expensive”  Offline 
stage.  In  the  inexpensive  Online  stage,  the  reduced-order  model  and  corresponding  error  bounds  can  be 
evaluated  for  user-specified  input  parameters  —  the  cost  of  the  Online  calculations  are  independent  of  the 
truth  discretization  and  hence  can  be  performed  very  rapidly  and  with  limited  computational  resources  [10]. 
The  Offline  stage,  on  the  other  hand,  corresponds  to  the  construction  of  the  reduced  order  model.  In 
this  stage  we  appeal  to  the  truth  discretization  to  develop  basis  functions  that  are  highly  tailored  to  the 
solution  manifold,  as  well  as  to  perform  other  truth  model-dependent  preprocessing  for  subsequent  Online 
evaluation  steps.  In  this  paper  we  develop  a  high-performance  implementation  of  the  Offline  stage  of  the 
RB  method.  This  enables  robust  treatment  of  problems  with  many  parameters  as  well  as  problems  with 
extremely  expensive  truth  solves,  which  are  out  of  reach  without  the  ability  to  exploit  parallel  computing 
resources.  We  explore  two  large-scale  model  problems  in  the  numerical  results  section  that  demonstrate  the 
effectiveness  of  our  implementation. 

1.1.  Overview  of  the  RB  Method 

There  are  many  discussions  of  the  RB  method  already  available  in  the  literature  (e.g.  [9,  11,  12,  13,  14]), 
hence  for  the  sake  of  brevity  we  do  not  reproduce  this  material  here.  However,  we  do  need  to  introduce  some 
basic  notation  that  will  be  required  in  the  subsequent  sections  of  this  paper.  Following  the  standard  RB 
method  notation  [9],  we  suppose  that  the  system  input  /x  €  V  C  Rp  (where  V  is  a  bounded  set)  enters  as 
a  parameter  in  a  PDE  which  describes  the  relevant  physical  phenomena  over  an  appropriate  spatial  domain 

C  R.d,d  =1,2  or  3.  We  may  consider  either  steady-state  or  time-dependent  problems;  in  the  time- 
dependent  case  we  suppose  0  <  t.  <  tf  is  the  time  interval  of  interest.  This  PDE  yields  («)  a  field  variable 
over  fl,  u(p)  £  X(£l)  (where  AT(ff)  is  an  appropriate  function  space),  and  (ii)  a  scalar  output  (or  outputs)  of 
interest,  s(/x)  €  R,  which  can  be  expressed  as  a  (say)  linear  functional  of  the  field  variable,  s(/x)  =  £{u{p)). 
Note  that  the  parameter  dependence  proceeds  from  the  PDE  through  the  field  variable  and  finally  to  the 
engineering  output. 

We  suppose  that  the  parametrized  PDE  is  discretized  by  finite  elements  in  space,  and  in  the  time- 
dependent  case,  by  a  finite  difference  scheme  in  time.  We  then  obtain  a  high-fidelity  finite  element  approxima¬ 
tion  by  computing  the  Galerkin  projection  over  a  finite  element  approximation  subspace  (c  X)  of  large 
dimension  Af.  For  any  /i  G  V,  our  truth  approximation  is  given  by  £  X ^  and  sJ^(p)  =  £(w^ (/x))  £  R. 

We  note  that,  given  our  restriction  to  p  £  T>,  all  solutions  of  interest  necessarily  reside  on  the  parametrically- 
induced  manifold  A4^  =  {vAr  {p)  |  p  £  V}.  We  observe  that  this  manifold  is  relatively  low-dimensional  and 
in  many  cases  of  practical  interest  it  can  be  expected  to  be  smooth.  Assuming  an  RB  approximation 
space  (which  we  shall  call  Xn  (c  X^)  where  dim(Abv)  =  N)  has  been  constructed1  we  then  consider 
Galerkin  projection  over  Ajv  to  obtain,  for  any  /x  £  V,  the  RB  solution  Un(p)  and  consequently  the  output 
Sn(p)  =  £(un(h)).  As  part  of  the  RB  method,  we  also  compute  rigorous  a  posteriori  bounds,  Apj(p)  and 
A sN(p),  for  the  error  in  the  RB  field  approximation  and  the  RB  output  approximation,  respectively.  Thus 
for  any  p  £  V,  we  have  \\u^(p)  —  un{p)\\ x  <  A n{p)  and  \s^(p)  —  sn{p)\  <  AsN(p).  The  existence  of  these 
bounds  allows  us  to  claim  that  our  RB  approximation  is  certified. 

The  rest  of  this  paper  is  arranged  in  the  following  way:  in  Section  2  we  focus  on  new  innovations  re¬ 
quired  to  provide  a  flexible  and  extensible  implementation  of  the  RB  method  that  can  be  employed  on 
high-performance  computers.  We  then  exploit  this  implementation  in  Section  3  to  explore  two  computa¬ 
tionally  intensive  problems:  a  natural  convection  problem  in  an  enclosed  three-dimensional  cavity,  and  a 
thermal  conduction  problem  with  parametrized  thermal  conductivities.  The  former  demonstrates  robust 


1  Tn  practice  we  employ  the  Greedy  algorithm  [9,  12]  to  construct  X\- .  see  Algorithm  1  for  details.  Additionally,  in  the 
time-dependent  case,  the  reduced  basis  approximation  inherits  the  same  temporal  discretization  as  the  truth. 
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model  reduction  for  a  nontrivial  nonlinear  problem  while  the  latter  demonstrates  the  efficacy  of  our  imple¬ 
mentation  for  a  problem  with  “many”  parameters.  In  order  to  gain  insight  into  the  complicated  parametric 
dependencies  in  the  thermal  conduction  problem,  we  also  use  our  rapid  and  accurate  RB  approximation  to 
perform  a  statistical  study  of  parameter  selection  in  the  Greedy  algorithm  and  of  the  parametric  sensitivity 
of  the  output.  In  each  case  the  Offline  stage  was  performed  on  the  TeraGrid  supercomputer  Ranger  at  the 
Texas  Advanced  Computing  Center  (TACC).  Finally,  in  Section  4,  we  give  a  few  concluding  remarks  and 
comment  on  potential  avenues  of  future  research  created  by  the  existence  of  our  open,  high-performance  RB 
simulation  framework. 

2.  High-Performance  Implementation  of  the  RB  Framework 

Our  implementation  of  the  RB  framework  is  available  as  part  of  the  open  source  C-| — |-  parallel  finite 
element  library  libMesh  [15].  We  utilize  libMesh’s  interfaces  to  PETSc  [16]  and  SLEPc  [17]  for  sparse 
linear  algebra  and  eigenvalue  solver  functionality,  respectively,  and  we  employ  existing  libMesh  functionality 
for  all  the  TV-dependent  operations  required  by  the  RB  method  [9] : 

(i)  PDE  solves  at  “greedily”  selected  parameter  values, 

(ii)  Truth  eigensolves  required  by  the  Successive  Constraint  Method  (SCM)  [18], 

(iii)  Poisson  solves  for  computing  the  Riesz  representations  of  the  residual  terms  and  outputs, 

(iv)  Inner  products  over  Q  to  obtain  the  data  arrays  required  in  the  Online  stage  of  the  RB  algorithm. 

The  object-oriented  organization  of  the  RB  code  is  illustrated  in  Figure  1.  The  base  class  RBBase 
implements  functionality  common  across  all  problem  types:  it  stores  the  “corners”  in  Rp  of  the  parameter 
domain  T>,  the  training  set  Strain  c  D,  and  the  set  of  parameter-independent  operators  and  parameter- 
dependent  functions  from  the  affine  expansion  of  the  PDE.  RBBase  is  extended  (via  the  C++  inheritance 
mechanism)  in  two  directions  which  encapsulate,  in  a  broad  sense,  the  two  major  classes  of  systems  —  (i) 
steady,  time-dependent,  and  nonlinear  PDE  systems  and  (ii)  SCM  eigensystems  —  which  must  be  solved  in 
the  RB  method.2 

The  RBSystem  class  (left  side  of  the  inheritance  tree  in  Figure  1)  represents  the  RB  implementation  for  el¬ 
liptic  problems.  Through  its  descendants  it  also  provides  support  for  linear  parabolic  (Trans  ientRBSystem) 
and  quadratically-nonlinear  parabolic  (QNTransientRBSystem)  PDEs.  These  classes  inherit  from  libMesh’s 
LinearlmplicitSystem  class,  thereby  gaining  access  to  matrix  assembly  and  linear  system  solver  inter¬ 
faces  for  performing  truth  finite  element  solves.  Each  class  provides  an  appropriate  implementation  of  the 
Greedy  algorithm;  for  example,  RBSystem  provides  the  standard  Greedy  algorithm  for  steady  state  prob¬ 
lems  [12]  (see  Section  2.2  for  details),  whereas  TransientRBSystem  implements  the  POD(t)-Greedy(/r) 
version  [19].  The  second  direction  in  which  RBBase  is  extended  is  for  the  SCM:  RBSCMSystem  inherits  from 
libMesh’s  EigenSystem  class  in  order  to  access  the  library’s  SLEPc  eigensolver  interface,  and  is  respon¬ 
sible  for  implementing  the  standard  SCM  algorithm.  QNTransientSCMSystem  extends  RBSCMSystem  to 
implement  the  modified  SCM  required  for  quadratically  nonlinear  time-dependent  problems  [20].  We  also 
provide  RBEIMSystem,  which  is  an  implementation  of  the  Empirical  Interpolation  Method  that  enables 
efficient  RB  treatment  of  non-affine  PDEs  [21]. 

2.1.  Parallelization  Opportunity  1:  N -Dependent  Operations 

We  exploit  two  key  opportunities  for  parallelization  in  the  Offline  stage  of  the  certified  RB  framework. 
First,  we  employ  libMesh  to  parallelize  the  TV-dependent  operations  of  the  RB  method  enumerated  in 
points  (i)-(iv)  in  the  previous  section.  In  addition,  libMesh  also  handles  the  more  “mundane”  yet  critical 
components  required  in  the  truth  solve,  such  as  the  input  and  parallel  partitioning  of  the  computational 
mesh,  and  the  output  of  truth  solutions  in  formats  recognized  by  popular  visualization  software.  libMesh 


“We  note  the  following  technical  detail:  RBBase  is  a  template  class,  hence  the  two  specific  instantiations  of  RBBase  shown 
in  Figure  1,  RBBase<LinearImplicitSystem>  and  RBBase<EigenSystem>,  represent  independent  inheritance  branches  and 
not  e.g.  a  “multiple  inheritance”  object  design. 
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Figure  1:  Class  diagram  for  rbOOmit.  Solid  lines  with  open  arrowheads  represent  C++  inheritance,  with 
arrows  pointing  from  child  to  parent  class.  Dashed  lines  represent  specific  instantiations  of  template  classes. 
Class  names  in  italics  are  part  of  the  LibMesh  [15]  finite  element  library. 


enables  us  to  treat  large-scale  truth  discretizations  because  it  employs  domain  decomposition  with  message 
passing  and  thread-based  parallelism  to  accelerate  matrix  assembly  and  distribute  memory  consumption 
among  multiple  processors. 

In  a  practical  RB  problem  we  must  define  an  “affine  decomposition”  a(v,  w ;  //)  =  Y^=i  w)  °f  a 

parametrized  PDE  operator  a(-,  •;  •)  :  x  X1^  x  V  — >  R,  where  the  aq{-,  •)  :  X1^  x  X^  — >  R.  are  parameter- 

independent  operators  and  the  0q{-)  :  V  — >  ffi.  are  parameter-dependent  functions.  To  illustrate  the  way  in 
which  libMesh  facilitates  parallelization  of  the  RB  method,  we  provide  example  assembly  code  for  a  simple 
parameter- independent  operator  —  the  Laplace  operator  on  a  subdomain  of  O  in  Listing  1.  The  assembly 
function  is  passed  a  FEMContext  reference  which  encapsulates  all  of  the  necessary  element-specific  data, 
and  a  reference  to  the  RBSystem  object  currently  being  assembled.  The  design  of  the  FEMContext  object 
allows  the  library  to  easily  implement  thread-based  parallelism  in  the  finite  element  assembly  procedures: 
element-specific  data  structures  can  be  duplicated  as  threads  are  spawned,  without  any  danger  of  accidentally 
introducing  shared  variable  lock  conditions.  An  additional  level  of  domain-decomposition  parallelism  is  also 
obtained  as  the  assemble_laplacian  ( )  function  is  called  on  processor  p  only  for  the  local  subset  of  finite 
elements  “owned”  by  processor  p. 

Examples  of  the  finite  element  data  contained  in  the  FEMContext  object  are  seen  in  lines  10,  14,  18,  and 
21  of  Listing  1,  where  references  to  element-specific  quadrature  rule  data,  shape  function  gradient  values, 
and  degree  of  freedom  counts  are  obtained  from  the  context  object,  which  is  called  “c”  in  the  code.  These 
data  allow  the  assembly  loop  which  follows  in  lines  23-26  to  be  element- generic,  that  is,  to  work  unchanged 
on  any  type  of  geometric  finite  element.  In  the  body  of  the  loop,  on  line  26  of  Listing  1,  the  element 
matrix  contribution  due  to  the  Laplace  operator  is  accumulated  in  c .  elem_  jacobian  (this  contribution 
is  subsequently  scaled,  in  a  separate  part  of  the  code,  by  the  parameter-dependent  function  value).  The 
Laplacian  example  considered  here  is  evidently  very  simple,  but  the  general  structure  in  Listing  1  can  be 
easily  extended  to  handle  much  more  complicated  operators. 

As  mentioned  previously,  libMesh  also  leverages  the  parallel  preconditioners  and  iterative  solvers  from 
PETSc  and  SLEPc  to  solve  large-scale  finite  element  systems  efficiently.  While  Krylov  subspace  iterative 
solvers  are  sufficiently  general  and  capable  of  handling  the  systems  which  must  be  solved  in  the  RB  method, 
in  some  cases  it  can  be  advantageous  to  employ  a  direct  solver.  For  example,  a  large  number  of  Poisson 
problems  with  fixed  matrix  and  multiple  right-hand  side  vectors  need  to  be  solved  during  the  Offline  stage 
to  compute  terms  required  for  constructing  the  RB  error  bound  [9] .  This  is  a  prototypical  scenario  in  which 
direct  solvers  can  outperform  their  iterative  cousins.  We  employ  the  parallel  sparse  direct  solver  MUMPS  [22] 
for  this  purpose,  and  this  predictably  leads  to  significant  speedup.  For  sufficiently  large  problems,  however, 
direct  solvers  may  still  become  infeasible  due  to  their  large  memory  footprint. 
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void  assemble_laplacian (FEMContext  &c.  Systems  system) 


if  (c . elem->subdomain_id ( )  ==  0) 

{ 

//  Get  index  of  "u"  variable  from  System 

RBSystems  rb_system  =  libmesh_cast_ref<RBSystem&> (system) ; 
const  unsigned  int  u_var  =  rb_system . u_var ; 

//  Reference  to  vector  of  quadrature  rule  data 
const  std: : vector<Real>  SJxW  = 

c . element_f e_var [u_var ] ->get_JxW ( ) ; 

//  The  velocity  shape  function  gradients  at  q.  pts 
const  std : : vector<std : : vector<RealGradient>  >  Sdphi  = 
c . element_f e_var [u_var ] ->get_dphi ( ) ; 

//  The  number  of  local  degrees  of  freedom  in  each  variable 
const  unsigned  int  n_u_dofs  =  c . dof_indices_var [u_var ] . size ( ) ; 

//  Now  we  will  build  the  affine  operator 

unsigned  int  n_qpoints  =  c . element_qrule->n_points ( ) ; 


} 

} 


for  (unsigned  int  qp=0;  qp  !=  n_qpoints;  qp++) 
for  (unsigned  int  i=0;  i  !=  n_u_dofs;  i++) 
for  (unsigned  int  j=0;  j  !=  n_u_dofs;  j++) 

c . elem_jacobian ( i, j )  +=  JxW[qp]  *  dphi [ j ] [qp] *dphi [ i ] [ qp] ; 


Listing  1:  Sample  libMesh  [15]  matrix  assembly  code  demonstrating  several  techniques  which  enable  par¬ 
allelization  in  the  Offline  stage  of  the  RB  framework. 


2.2.  Parallelization  Opportunity  2:  The  Greedy  Algorithm 

The  other  opportunity  for  parallelization  is  in  the  Greedy  algorithm  (see  Algorithm  1),  which  is  employed 
to  select  the  basis  function  “snapshots”  of  ATV\  which  we  denote  by  {C»,  •  •  • ,  Civ}-  The  Greedy  algorithm 
entails  evaluation  of  the  RB  approximation  and  associated  error  bounds  on  the  entire  set  of  training  points, 
Stram.  There  are  two  key  computational  tasks  in  the  Greedy  algorithm  that,  for  brevity,  we  encapsulate 
here  in  subroutines  Construction  and  Evaluation.  The  Construction  stage  takes  the  current  set 
of  basis  functions,  {([i, . . . ,  (n},  as  input  and  develops  the  Online  Dataset(lV)  needed  to  evaluate  the  RB 
approximation  and  associated  error  bounds: 

[Online  Dataset(A^)]  =  Construction({C„}ra=i  .  ^y)-  (1) 

The  Evaluation  subroutine  takes  a  discrete  parameter  set  S  C  V  as  input  and  returns  p*  =  argmax^gs  An(p), 
and  err  =  An(p*): 

[p*,err]  =  Evaluation(S;  Online  Dataset (./V)),  (2) 

Given  that  the  Evaluation  computations  are  very  inexpensive,  we  are  usually  able  to  utilize  relatively 
large  training  sets  and  obtain  good  “coverage”  of  V.  Nevertheless,  there  are  two  situations  in  which  the 
error  bound  sampling  on  Strain  tends  to  dominate  the  computational  cost  of  the  Offline  stage.  The  first  is 
problems  with  many  parameters,  in  which  —  due  to  the  “curse  of  dimensionality”  —  we  need  to  choose  very 
large  training  sets  in  order  to  obtain  a  reasonable  coverage  of  the  parameter  domain.  The  other  situation  is 
quadratically  nonlinear  problems  in  which  error  bound  evaluation  requires  0(N 4)  operations  [23,  24].  In  this 
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scenario,  when  N  1$>  1,  the  RB  solves  required  by  the  “arg  max”  operation  in  (2)  may  require  a  significant 
proportion  of  the  overall  runtime  of  the  Offline  stage. 


Algorithm  1  Greedy  Algorithm. 

1:  specify  Strain  C  V  of  size  nt rain  and  tolerance  e. 

2:  select  g*  G  V  (arbitrary),  set  N  =  0  and  X0  =  {0}. 

3:  while  err  >  e  do 

4:  Ov+i  =  (I  ~  HxN)uM (n*)  (normalized); 

5:  XN  -s—  XN  ®  span(£/v+1); 

6:  At-TV  +  1; 

7:  [Online  Dataset(A)]  =  Construction({<(i}j=ij...jjv)  (update)', 

8:  |^*,err]  =  Evaluation  (Strain,  Online  Dataset(A)); 

9:  end  while 

10:  set  TVmax  i  N . 


Fortunately,  we  can  employ  an  “embarrassingly  parallel”  implementation  of  the  “arg  max”  operation. 
We  initialize  by  partitioning  the  training  set  Strain  into  disjoint  subsets  {Pi,i  =  1, . . .  ,  nproc},  where  nproc 
denotes  the  number  of  processors.  The  parallel  algorithm  then  proceeds  with  a  local  “arg  max”  operation 
performed  simultaneously  on  all  processors.  Finally,  an  inexpensive  parallel  “arg  max”  is  performed  for  the 
nproc  local  “winning”  candidate  parameters  to  find  the  global  maximizer  from  Strain.  This  simple  strategy 
reduces  the  computation  time  for  the  ntrain  RB  solves  by  a  factor  of  nproc,  and  (as  we  demonstrate  in 
Section  3)  can  lead  to  very  significant  Offline  speedup. 

3.  Numerical  Results 

We  now  consider  two  computationally  expensive  example  problems  that  illustrate  the  RB  techniques 
discussed  above:  a  nonlinear  unsteady  Boussinesq  problem  and  a  “many  parameter”  thermal  conduction 
problem.  We  perform  a  detailed  numerical  study  of  the  behavior  of  the  Greedy  algorithm  for  the  thermal 
conduction  problem,  and  we  also  exploit  our  reduced  basis  approximation  to  perform  a  statistical  study  of 
sensitivity  derivatives.  All  Offline  computations  were  performed  on  the  Ranger  supercomputer  (located  at 
the  Texas  Advanced  Computing  Center),  which  has  in  total  62,976  cores,  123TB  of  memory,  and  a  theoretical 
peak  performance  of  579  TFLOPs.  We  generate  reduced  order  models  with  rigorous  error  bounds  that  can 
be  evaluated  on  a  standard  desktop  or  laptop  computer  in  real-time  or  near  real-time. 

3.1.  Unsteady  Boussinesq  Equations 

We  first  consider  a  three-dimensional  version  of  the  unsteady  natural  convection  problem  from  [24] .  The 
nondimensional,  coupled  conservation  of  momentum,  energy,  and  mass  equations  for  this  problem  are: 

^  +  — - - (V  ■  VV  +  V- W)  +VGrPrVP 

dt  2v/GrPr V  ' 

-  V2V  +  VGrPrTfl  =  0  (3) 

■  vr  +  v-rF)  -  — v2r  =  o  (4) 

dt  27&PrV  'Pr 

V  •  V  =  0  (5) 

where  V  =  (V) ,  lA,  ks)  is  the  nondimensional  velocity  vector,  P  is  the  nondimensional  pressure,  T  is  the 
nondimensional  temperature,  and 

g  =  (—  sin  </>,  0,  —  cos  </>) 
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is  a  unit  vector  in  the  direction  of  gravity.  Gr  is  the  Grashof  number,  a  measure  of  the  strength  of  buoyant 
effects  relative  to  thermal  diffusion,  and  Pr  =  0.71  is  the  Prandtl  number  (we  assume  the  fluid  is  air). 
Additional  details  of  the  nondimensionalization  can  be  found  in  [24]. 

The  nondimensional  domain  is  =  (0, 5)3\V  where  V  is  the  pillar  (2.4,  2.6)  x  (1.5, 3.5)  x  (0, 1).  We  solve 
for  the  field  variables  V,  P,  and  T  throughout  Cl.  The  “roof”  of  the  cavity  is  maintained  at  temperature 
T  =  0,  the  sides  and  base  of  the  cavity  are  perfectly  thermally  insulated,  and  the  top  and  sides  of  the  pillar 
are  subject  to  a  uniform  heat  flux  of  magnitude  Gr.  We  impose  no-slip  velocity  conditions  on  all  boundaries. 

The  truth  weak  formulation  for  (3)— (5)  requires  only  minor  modifications  from  the  two-dimensional  case 
in  [24],  hence  we  omit  it  here  for  the  sake  of  brevity.3  We  consider  the  time  interval  t  €  [0, 0.16],  and  employ  a 
Crank-Nicolson  temporal  discretization  with  K  =  100  timesteps.  The  three-dimensional  finite  element  mesh 
is  composed  of  quadratic  tetrahedral  elements  and  has  71,279  nodes  and  50,574  elements;  cutaway  views 
showing  the  internal  details  of  the  tetrahedralization  are  given  in  Figure  2.  The  total  number  of  degrees  of 
freedom  for  the  velocity,  temperature  and  pressure  is  A f  =  294,502.  The  RB  and  SCM  implementations  in 
this  case  were  based  on  the  classes  QNTransientRBSystem  and  QNTransientSCMSystem,  respectively. 


Figure  2:  x  (2a)  and  y  (2b)  interior  cutaway  views  of  the  mesh  used  in  the  three-dimensional  unsteady 
Boussinesq  calculations.  The  output  regions,  which  sit  just  above  the  heated  “pillar,”  are  meshed  exactly. 
The  pillar  itself  is  a  rectangular  cavity  at  the  base  of  the  domain  and  is  not  meshed.  The  mesh  is  composed 
of  quadratic  tetrahedral  elements  and  has  71,279  nodes  and  50,574  elements. 

We  consider  a  two-tuple  parameter  /i  =  (^1,^2)  =  (Gr,  <j>)  G  V  =  [4000,6000]  x  [0,0.2].  Our  goal  is 
to  study  the  parametric  dependence  of  the  temperature  in  regions  near  the  top  of  the  heated  pillar  in  the 
presence  of  natural  convection.  Therefore,  we  consider  the  following  L2(f2)-bounded  functionals  of  T 

s„{t\  n)  =  y),n)  =  — [777/  T{t\n)\  (6) 

Mi|A»|  Jd„ 

as  outputs,  where  D1  =  [2.2,  2.4]  x  [1.5, 3.5]  x  [1, 1.1],  D2  =  [2.4,  2.6]  x  [1.5,  3.5]  x  [1, 1.1],  and  D3  =  [2.6, 2.8]  x 
[1.5,  3.5]  x  [1, 1.1]  are  three  small  rectangular  prisms  above  the  pillar.  We  denote  the  RB  approximations  to 
the  outputs  from  (6)  as  Sjvjn(<;  p),  n  =  1,2,3.  A  cross-section  of  the  domain  geometry  and  output  regions 
is  given  in  Figure  3,  cross-sections  of  the  truth  solution  at  t  =  0.16  for  (Gr,  0)  =  (6000,0.2)  are  shown  in 
Figure  4. 


’'Note  that  we  employ  skew-symmetric  convection  operators  in  (3)  and  (4)  to  recover  discrete  stability  properties  and  simplify 
RB  error  bound  derivation. 


7 


Figure  3:  Cross-section  at  y  =  2.5  of  the  computational  domain;  note  that  fl  does  not  include  the  pillar 
which  is  shaded  in  red.  Cross-sections  of  the  output  regions  Di,  D2  and  D3  are  also  indicated. 

We  construct  a  (primal-only)  RB  space  via  the  POD(f)-Greedy(/r)  algorithm.4  We  specify  e  =  9  x  10~4 
for  the  target  tolerance  and  SN  =  3  for  the  number  of  POD  modes  to  include  at  each  greedy  iteration;  we 
further  specify  a  32  x  32  tensor  product  train  sample  Strain  of  uniformly  spaced  points  in  V.  The  tolerance 
is  satisfied  for  7Vmax  =  90  and  the  Greedy  convergence  is  shown  in  Figure  5.  Note  that  in  the  quadratically 
nonlinear  case  we  report  a  nominal  error  bound  during  the  POD(f)-Greedy(/x)  since  the  SCM  cannot  be 
performed  until  after  the  RB  space  has  been  generated  [20].  Nevertheless,  this  does  not  compromise  the 
rigor  of  the  RB  error  bounds  in  the  Online  stage.  All  Offline  calculations  are  performed  on  128  processors  on 
Ranger.  The  total  wall  time  required  for  the  entire  Offline  stage,  including  the  SCM,  is  significant:  about  42 
hours.  The  “embarrassingly  parallel”  solves  over  Strain  during  the  greedy  algorithm  require  approximately 
24  total  minutes  in  the  calculation  —  a  small  percentage  of  the  overall  Offline  runtime.  It  is  important  to 
note,  however,  that  had  these  solves  not  been  parallelized  in  the  manner  discussed  in  this  paper,  the  total 
wall-clock  time  for  this  stage  of  the  calculation  would  have  been  approximately  128  x  24  minutes,  or  just 
over  51  hours. 

In  Figure  6,  we  show  RB  output  approximations  and  corresponding  RB  error  bounds  obtained  with 
N  =  90  to  the  sn(t;  n)  for  parameter  values  (Gr,  </>)  =  (4000, 0),  (5000,  0.1),  and  (6000,  0.2).  We  note  that,  as 
expected,  the  “left”  and  “right”  outputs  (si  and  S3,  respectively)  coincide  when  <f>  =  0  and  their  separation 
increases  as  (j>  increases.  The  Online  computation  time  with  N  =  90  is  20.6  seconds  on  an  AMD  Opteron 
2382  processor.  Most  of  this  time  is  due  to  the  0{N 4)  complexity  of  the  error  bounds  in  the  quadratically 
nonlinear  case.  Nevertheless,  this  still  represents  a  speedup  factor  of  approximately  63  with  respect  to  a 
single  truth  solve  on  Ranger,  which  requires  21.7  minutes  on  128  processors. 

3.2.  Transient  Thermal  Conduction 

We  next  consider  a  “many  parameter”  problem,  which  is  challenging  from  the  point  of  view  of  model 
reduction  due  to  the  “curse  of  dimensionality” .  We  consider  transient  thermal  conduction  in  a  three- 
dimensional  “Swiss  cheese”  configuration.  This  is  a  generalization  of  the  steady-state  “thermal  block” 
problem  posed  on  the  unit  square  in  two  spatial  dimensions  in  [25];  here  we  consider  a  time- dependent 
formulation,  a  more  complicated  domain  Q  and  a  much  more  expensive  truth  discretization.  The  domain  D 
is  given  by  the  unit  cube  [0,  l]3  \  S,  where  S'  is  a  lattice  of  27  spheres  of  radius  1/5  and  centers  {0, 1/2, 1}  x 
{0, 1/2, 1}  x  {0, 1/2, 1}.  We  subdivide  D  into  27  subdomains,  D  =  the  O,  are  shown  in  Figure  7a, 

and  are  numbered  sequentially  (in  a  standard  “structured”  grid  pattern)  starting  with  Dq  at  the  origin  and 


4  A  primal-dual  formulation  is  not  appealing  in  this  case  since  (i)  the  beneficial  effect  on  output  error  bounds  would  be 
limited  due  to  exponential  error  bound  growth,  and  ( ii )  we  would  need  a  dual  system  for  each  output. 


Figure  4:  x  (4a)  and  y  (4b)  interior  cutplane  views  of  the  truth  solution  for  (Gr ,<p)  =  (6000,0.2).  Color 
contours  show  the  temperature  field  and  the  overlaid  velocity  vectors  indicate  the  circulatory  nature  of  the 
flow. 


then  proceeding  with  x-axis  aligned  rows,  advancing  in  the  y-direction  until  one  layer  is  complete,  followed 
by  the  same  pattern  for  the  middle  and  top  layers.  In  particular,  the  bottom  “corners”  of  the  domain 
correspond  to  fl0,  fl2)  and  while  the  central  subdomain  is  fl13. 

The  governing  equations  are  most  easily  described  in  weak  form:  find  u(t]  /i)  £  X  for  0  <  t  <  tf  =  1  such 


that 


~  26 
OU 


Xu  •  Xv  +  /  Xu • Xv 


j= i 


Vu  £  X, 


(7) 


where  is  the  thermal  conductivity  of  region  flj  relative  to  the  thermal  conductivity  of  region  Qq,  X  =  {u  £ 
i/1(S2)  v|r,  =  0},  Fb  is  the  bottom  boundary  “z  =  0”  (on  which  we  impose  inhomogeneous  flux  boundary 
conditions),  and  Ft  is  the  top  boundary  “z  =  1”  (on  which  we  impose  zero  Dirichlet  conditions).  We 
impose  homogeneous  natural  (zero)  flux  conditions  on  all  other  boundaries.  We  choose  P  =  26  parameters, 
jij .  1  <  j  <  26,  and  specify  the  parameter  domain  V  =  [0.5,  2]26.  We  consider  one  output, 


s(t;  n)  =  i{u(t ;  ft))  =  t^-t  /  u(t;  /x), 
l^o|  Jn0 

the  average  temperature  over  STo ,  where  flo  is  the  intersection  of  il  with  the  cube  [0, 1  /4] 3 . 

We  pursue  a  primal-dual  approach  for  this  linear  time-invariant  (LTI)  parabolic  problem,  where  the  dual 
problem  runs  “backwards”  in  time  [11].  The  weak  formulation  for  the  dual  problem  is:  find  z(t;  /i)  £  X  for 
0  <  t  <  tf  such  that 


-  /  v 


dz 

dt 


26 

j= 1 


Vu  •  Vz  +  /  Vu-Vz  =  0,  Vu  £  A', 


(8) 


with  fQvz(tf )  =  £(v).  We  select  a  backward  Euler  temporal  discretization  with  K  =  100  time  levels  for 
both  the  primal  and  dual  problems  and  employ  a  truth  finite  element  approximation  space  with  J\f  =  241,520 
degrees  of  freedom.  Both  the  primal  and  dual  systems  are  implemented  via  the  TransientRBSystem  class 
discussed  in  Section  2,  and  the  SCM  is  not  required  here  due  to  the  simple  “parametrically  coercive”  [9] 
structure  of  (7)  and  (8).  In  Figure  8  we  show  the  primal  and  dual  truth  solutions  at  t  =  1  and  t  =  0, 
respectively,  for  the  parameter  value 


y  =  (2, 2,  2,  2,  0.5, 2, 0.5,  0.5,  2,  2,  2, 2, 2,  2,  2,  2, 0.5,  2,  2,  2, 2,  2,  2,  2, 2, 2). 


(9) 
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Figure  5:  Greedy  convergence  for  the  3D  Boussinesq  pillar. 


This  parameter  specifies  the  minimum  thermal  conductivity  in  Qg  and  the  adjacent  subdomains  Clg,  D7,  and 
0-i7,  therefore  we  observe  a  “hot  corner”  in  Figure  8a. 

We  employ  the  POD(f)-Greedy(/z)  for  both  the  primal  and  dual  problems  and  develop  an  RB  space  with 
N  =  100  in  each  case.  To  investigate  the  behavior  of  the  Greedy  in  this  “many  parameter”  problem,  we 
perform  the  Offline  stage  three  times  with  log-randomly  generated  training  sets  of  size  ntrain  =  102,  104,  and 
106.  The  Offline  burden  due  to  the  expensive  truth  and  (in  the  ntr ain  =  106  case)  large  ?rtrain  is  considerable, 
and  both  forms  of  parallelization  from  Sections  2.1  and  2.2  are  crucial.  Here  we  used  512  processors  of 
Ranger  for  the  Offline  computations,  and  the  total  Offline  wall-clock  time  in  the  ntrain  =  106  case  was 
approximately  31  hours.  The  time  spent  evaluating  RB  error  bounds  for  the  ntrain  =  106  calculation  was 
roughly  an  hour  (per  processor).  The  embarrassingly  parallel  “arg  max”  operation  discussed  in  Section  2.2 
is  therefore  clearly  essential,  and  in  this  case  saved  over  500  hours  of  computation  time  compared  to  an 
equivalent  serial  algorithm. 

In  Figure  9  we  show  the  (normalized)  POD(t)-Greedy(/r)  convergence, 

N  ||njv(tA";  H*)\\l2(£1)  ’ 

where  A j^(/i*)  denotes  the  L2(H)-norm  RB  error  bound  at  time  level  K,  for  the  primal  RB  space  (the  dual 
convergence  plot  is  analogous  and  hence  is  omitted)  with  the  three  choices  of  ntrain-  We  observe  that  the 
POD(f)-Greedy(/z)  convergence  reported  in  the  ntrain  =  102  case  is  overly  optimistic  due  to  the  coarseness 
of  the  training  set  -  here  Strain  is  a  poor  surrogate  for  V.  However,  the  ?itrain  =  104  and  106  cases  agree 
quite  well,  and  this  suggests  that  these  larger  training  sets  provide  a  reasonable  surrogate  for  V.  Of  course, 
even  the  ?itrain  =  106  training  set  is  extremely  sparse  in  the  26-dimensional  parameter  domain  (e.g.  a  tensor 
product  grid  with  only  two  points  in  each  parameter  direction  would  have  226  «  6.7  x  10'  points)  but  the 
POD(f)-Greedy(/x)  nevertheless  performs  well  due  to  the  highly  smooth  “parametrically  coercive”  [9]  nature 
of  this  problem. 

We  now  proceed  to  the  Online  stage.  In  order  to  test  the  accuracy  of  the  primal-dual  RB  approximation, 
we  perform  RB  solves  on  a  5000  point  log-random  test  set,  Stest,  for  the  RB  spaces  generated  by  the 
Strain  =  102  and  ntrain  =  106  training  sets.  We  compute  the  maximum  relative  output  error  bound  over 
£test,  A^raex1(Stest),  defined  as 


A: 


s.rel /"testx  _ 


=  max 


&N  (i  ,M) 


SN(tK-,ri-A'N(tK-,U.y 


(10) 


10 


N 

A5,rel| 

max ’ 

Strain  =  10 

f^test^ 

Strain  =  10^ 

Online  time  (seconds) 

25 

1.13  x  10"1 

1.16  x  10”1 

3.48  x  10"2 

50 

2.73  x  10"2 

2.44  x  10"2 

1.01  x  10"1 

75 

1.03  x  10"2 

1.05  x  10~2 

2.10  x  10"1 

100 

5.22  x  10"3 

4.34  x  10-3 

2.74  x  10"1 

Table  1:  Relative  output  error  bounds  over  the  test  set  5test  containing  5000  log-randomly  selected  points 
and  corresponding  Online  evaluation  timings. 


where  A sN(tK\ fi)  (resp.  S]si{tK ;  fi))  denotes  the  primal-dual  output  bound  (resp.  dual-corrected  RB  output)  at 
the  final  time,  tK(=  tf).  Note  that  the  denominator  in  (10)  is  a  lower-bound  for  the  truth  output  s^(tK;  fi). 
In  Table  1  we  present  results  for  N  =  25,  50,  75,  and  100  along  with  corresponding  Online  computation 
times  for  a  single  primal-dual  RB  solve.5  The  data  clearly  indicate  the  trade-off  between  RB  accuracy  and 
Online  cost.  We  note  from  Table  1  that  for  the  ritrain  =  106  primal-dual  RB  spaces  with  N  =  100  we  obtain 
an  Online  output  approximation  with  relative  error  bound  less  than  0.5%  in  0.274  seconds  —  compared  to 
approximately  200  seconds  for  a  truth  solve  on  512  processors  of  Ranger. 


Statistical  analysis  of  the  Greedy  algorithm 

The  data  generated  during  the  primal  and  dual  basis  training  procedures  for  this  problem  provide  a  rich 
dataset  and  therefore  it  is  instructive  to  examine  the  distribution  of  the  parameter  values  selected  by  the 
Greedy  algorithm  in  the  high-dimensional  space  V.  We  first  consider  the  primal  POD(t)-Greedy(/x)  calcu¬ 
lation  with  retrain  =  106  (recall  that  Strain  is  log-uniform  randomly  distributed)  and  we  generate  individual 
histograms  for  each  of  the  26  parameters,  fij,  1  <  j  <  26,  based  on  the  100  greedily-selected  ji  values.  We 
specify  ?Xbins  equally-sized  bins  within  the  range  [/xmi„,  /xmax]  =  [0.5,  2]  and  let  fij(i),  1  <  i  <  ribins  denote  the 
number  of  occurrences  of  fij  within  bin  i.  To  simplify  the  presentation  we  then  generate  a  histogram  for  the 
“composite”  variable  (fi), 

1  26 

(MX*)  =  1  <  *  <  ™bins-  (11) 

1=1 

This  composite  histogram  for  ribins  =  10  is  given  in  Figure  10,  where  it  is  compared  to  a  histogram  for  a 
log-uniform  random  sample  of  data  points  over  the  same  parameter  range.  The  greedily-selected  parameter 
values  are  biased  towards  the  endpoints  of  the  fi  domain  relative  to  the  log-uniform  random  sample.  The 
tendency  of  the  Greedy  algorithm  to  select  parameter  values  near  the  endpoints  appears  analogous  to  the 
endpoint  clustering  seen  in  classical  interpolation  theory,  for  example  the  optimal  approximation  properties 
provided  by  Chebyshev  points  in  one  dimension  [27]. 

In  addition  to  the  statistical  distribution  of  the  fi  values,  it  is  also  instructive  to  look  at  the  different 
spatial  configurations  of  the  parameters  selected  by  the  greedy  algorithm,  in  particular  the  jumps  in  thermal 
diffusivity  between  adjacent  subregions.  We  define  the  average  jump  for  a  given  fi  vector  as 

{Sfi)  =  -^-\  l/b-Mx'l  (12) 

(ij)ei 

where  X  is  the  set  of  all  ordered  pairs  ( i,j )  for  which  i  <  j  and  regions  fl*  and  fij  share  a  common  surface. 
For  the  3x3x3  configuration  considered  in  the  “Swiss  cheese”  problem  (see  Figure  7a)  there  are  54  unique 
ordered  pairs  in  X.  A  histogram  of  (<5/x)  values  computed  for  the  ntrain  =  106  case  is  given  in  Figure  11, 


Computation  times  are  averaged  over  Stest  and  are  from  an  AMD  Opteron  2382  processor. 
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where  it  is  compared  to  the  jumps  computed  for  a  large  log-uniform  random  sample  of  data.  The  jumps  in 
the  greedily-selected  dataset  are,  on  average,  larger  than  those  from  the  log-uniform  random  sample.  This 
suggests  that  the  greedy  algorithm  detects  higher  error  in  y  configurations  with  larger  diffusivity  jumps  — 
which  in  some  sense  are  “higher  frequency”  modes  —  thereby  preferentially  selecting  them  for  inclusion  in 
the  reduced  basis. 


Sensitivity  derivative  analysis 

We  have  demonstrated  that  our  primal-dual  RB  method  yields  a  very  fast  and  accurate  output  approxima¬ 
tion  for  the  “Swiss  Cheese”  problem.  Our  goal  in  this  final  subsection  is  to  demonstrate  that  the  primal-dual 
RB  approximation  can  be  “recycled”  in  order  to  provide  useful  sensitivity  derivative  information  at  a  fraction 
of  the  computational  cost  of  the  full-order  model.6  Sensitivity  derivatives  are  crucial  in  many  computational 
engineering  contexts,  including  control,  optimization,  parameter  estimation  and  uncertainty  quantification 
[28,  29,  30].  In  the  present  study  we  exploit  the  RB  method  to  perform  a  statistical  analysis  of  the  sensi¬ 
tivity  derivatives  from  a  large  parameter  sample  to  gain  a  “global”  view  of  the  relative  importance  of  the 
parameters. 

We  first  sketch  the  derivation  of  the  sensitivity  formulation  for  this  problem.  Working  with  the  continuous¬ 
time  weak  formulation  for  the  sake  of  clarity,  we  note  that 


d  du 


26 


du 


du 


md^v  +  ^'l>  L  v^'Vw+  L  v7hZ  Vv  =  ~  i  Vw’Vu’  Vve*’ 


dy. 


p  j= i 

for  p  =  1, . . . ,  26.  Then,  from  (8)  and  (13)  we  obtain 

pt  /  o  r\  26 

X7u  •  Vz  )  dt  =  / 


d  du 


in¬ 


to  \  J  n 


dt  d/jj 


z+H 


Hj 


3=1 


d  ftp 


_  du 

V—  •  Vz- 

d  fip 


(13) 


•  va1L 

to  d^P 


•  Vz  dt  = 


ds 

dyp 


(f;  n). 


(14) 

This  relationship  carries  over  directly  to  our  fully-discrete  truth  finite  element  formulation,  so  that  we  obtain: 


=  Xj At  Jn  VuA/”(ifc;M)  •  VzM(tK  t+k-,n)^  .  (15) 


In  the  context  of  RB  sensitivity  derivatives,  however,  this  relationship  is  only  approximate  since  we  employ  a 
Galerkin  method  and  the  primal  and  dual  RB  spaces  do  not  coincide.  Also,  we  employ  Lagrange  RB  spaces 
here  which  are  targeted  at  u':V"  and  ,  not  and  ,  hence  even  if  the  RB  residuals  are  small  it  is 
not  clear  that  the  primal-dual  RB  approximation  will  lead  to  good  sensitivity  approximations.7  Hence,  our 
first  task  is  to  determine  whether  the  RB  approximation  developed  for  this  problem,  which  is  very  effective 
for  output  approximation,  is  also  satisfactory  for  sensitivity  derivative  approximation.  We  define  our  RB 
approximation  to  ^-(t^;  p.)  for  p  =  1, . . . ,  P  as, 


JpN(te;  p)  =  At  (~  [  VuN(tk;  im)  ■  S7zN{tK~t+k ;  y) 
fc= i  V  ^ np  / 


(16) 


and  set  JN{te;y)  =  y), . . . ,  Jff{te ;y))T  G 


For  convenience,  we  also  denote  JA"(t^;  /i)  = 


6 We  note  that  direct  and  finite-difference  approximations  for  sensitivity  approximations  both  require  0(P)  forward  solves. 
Since  we  have  P  =  26  parameters  here,  dual-based  sensitivity  calculations  are  certainly  preferable. 

7To  mitigate  this  effect  we  could  enrich  our  RB  space  with  derivative  information  to  obtain  a  Hermite  RB  space  [31],  but 
we  do  not  consider  this  here. 
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We  examine  the  accuracy  of  the  RB  sensitivity  derivative  approximation  using  the  ntrain  =  106  primal  and 
dual  RB  spaces,  with  N  =  100.  We  generate  a  second  test  set,  S!fst,  of  30  randomly  selected  parameters  in 
V  on  which  to  compare  J ^  and  Jn\  the  sample  set  is  necessarily  relatively  small  due  to  the  computational 
expense  of  the  truth  calculations.  We  note  that  the  RB  sensitivity  approximations  are  obtained  at  very 
modest  computational  cost:  each  truth  sensitivity  calculation  requires  663  seconds  on  512  processors  of 
Ranger  (which  includes  a  primal  and  dual  truth  solve  and  evaluation  of  (15)  for  t  =  1, . . . ,  K),  whereas  the 
corresponding  RB  calculation  requires  only  2.3  seconds  on  a  single  processor. 

We  consider  two  error  metrics  in  our  comparison:  the  mean  relative  error  in  the  RB  sensitivity  derivatives 

c  _  f  II  Hi)  ~  JN(tk; 

and  the  mean  error  in  the  normalized  RB  sensitivity  derivatives: 

_  x  K  _____  _____ 

I -test  I  Y.  IIJAA(^;  V i )  ~  JN(tk; 

piS  E‘est  k= 1 

where  =  JM(tk; n)/\\  /z)||2,  JN(tk\n)  =  J N  {tk  ]jj))  / \\  JN(tk;  /x)||2  and  ||  •  ||2  denotes  the 

discrete  i 2  norm.  Our  test  calculations  yielded  £j  =  0.33  and  £j  =  0.015.  This  indicates  that  our  RB 
sensitivity  approximation  does  not  capture  the  truth  sensitivities  accurately,  but  it  does  reproduce  the 
normalized  sensitivities  very  well  -  to  within  2%.  We  would  have  to  enrich  our  RB  spaces  further  to  reduce 
the  error  in  £j ,  but  in  the  present  context  we  are  in  fact  primarily  interested  in  the  normalized  sensitivity 
derivatives  since  the  components  of  J-v"  indicate  the  relative  sensitivity  of  the  output  to  the  parameters. 
Hence,  we  exploit  the  fact  that  we  have  a  cheap  and  accurate  RB  approximation,  Jn  ~  •/-v" ,  to  perform  a 
statistical  analysis  of  the  normalized  sensitivities  in  order  to  obtain  a  “global”  view  of  the  relative  importance 
of  the  parameters.  We  compute  sensitivity  derivatives  for  a  sample  of  10,000  parameters  in  T>  and  we  show 
the  mean,  (J(v(ffc)),  and  standard  deviation,  a(J]y(tk)),  at  t  =  0.05,  t  =  0.25  and  t  =  1  in  Figures  12  and  13. 

All  components  of  the  mean  sensitivity  derivatives  in  Figure  12  are  negative,  which  is  consistent  with 
the  fact  that  an  increase  in  thermal  conductivity  leads  to  a  decrease  in  output  magnitude.  We  see  from 
Figure  12a  that  at  early  times  only  7  of  the  26  parameters  have  a  significant  effect  on  the  output,  and  that 
the  output  is  most  sensitive  to  parameters  /ii,  /z 3,  /Z4  and  /zg,  which  determine  the  conductivities  in  the 
regions  adjacent  to  Ho-  Figures  12b  and  12c,  however,  show  that  at  later  times,  once  heat  has  diffused 
through  the  whole  domain,  the  output  sensitivity  to  parameters  in  subdomains  further  from  Ho  becomes 
more  significant  and  the  sensitivities  are  overall  much  closer  in  magnitude.  As  steady  state  is  approached, 
we  see  from  Figure  12c  that  the  output  is  most  sensitive  to  /Z13;  this  is  because  subdomain  13  is  the  central 
cube  in  H  and  has  the  largest  volume  (see  Figure  7a).  Finally,  we  note  that  over  our  sample  set  the  mean 
and  standard  deviation  of  J7N  and  are  small  over  the  whole  time  interval.  This  suggests  that 

and  Ms  would  be  the  primary  candidates  to  be  eliminated  from  the  problem  if  we  sought  to 
reduce  model  complexity.  These  parameters  correspond  to  the  five  subdomains  furthest  away  from  H0  in 
the  bottom  “layer”  of  H,  so  it  is  not  surprising  that  they  have  relatively  little  impact  on  the  output. 

4.  Conclusions 

The  certified  Reduced  Basis  method  provides  a  computational  framework  for  the  development  of  accurate 
reduced  order  approximations  with  rigorous  error  bounds  for  parametrized  PDEs.  However,  the  framework 
is  quite  elaborate  and  contains  a  number  of  subcomponents  in  both  the  Offline  and  Online  stages.  We 
have  developed  a  high-performance  object-oriented  implementation  of  the  certified  Reduced  Basis  method 
that  is  flexible,  extensible,  efficient  and  user-friendly.  There  are  many  other  extensions  to  the  RB  method 
that  have  not  been  discussed  in  detail  here.  Two  notable  examples,  which  are  in  fact  provided  in  the  RB 
extension  to  libMesh,  are  the  Empirical  Interpolation  Method  [21]  for  efficient  treatment  of  non-affine 
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parametrized  PDEs,  and  the  hp- RB  method  [32,  33]  which  adaptively  subdivides  T>  in  order  to  develop  a 
family  of  “localized”  RB  spaces  that  lead  to  acceleration  of  the  Online  stage. 

We  demonstrated  our  RB  implementation  on  two  computationally  expensive  problems  —  a  transient 
Boussinesq  problem  and  a  transient  thermal  conduction  problem.  In  each  case  we  demonstrated  that  our 
RB  approximation  provides  a  very  large  speedup  and,  equally  importantly,  a  vast  reduction  in  hardware 
requirements.  We  performed  a  detailed  study  of  the  thermal  conduction  problem  because  it  is  an  archetypal 
example  of  a  “many  parameter”  model  reduction  problem.  We  examined  the  parameter  selection  behavior 
of  the  Greedy  algorithm,  and  we  also  gained  interesting  insights  into  the  non-trivial  “global”  parametric 
sensitivities  by  leveraging  the  rapid  response  of  the  primal-dual  RB  sensitivity  approximation  to  perform  a 
statistical  analysis  on  a  large  sample  set.  These  numerical  examples  demonstrate  that  our  high-performance 
implementation  of  the  RB  method  enables  efficient  analysis  of  large-scale  parametrized  PDEs  in  real-time 
and  many-query  contexts,  and  therefore  presents  new  opportunities  for  robust  reduced  order  modeling  in  a 
range  of  different  application  areas. 

A  key  goal  of  our  future  work  is  to  employ  Online  RB  approximations  in  deployed  contexts  in  which 
we  interface  PDE-based  models  with  experimental  data  in  real-time.  This  could  enable  us  to  apply  high- 
fidelity,  three-dimensional  PDE  models  in  areas  such  as  non- destructive  testing,  real-time  control,  or  in  situ 
parameter  estimation,  which  are  out  of  reach  with  classical  PDE  discretizations. 
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(d)  (5000,0.1),  Zoom  near  tf 
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Figure  6:  The  RB  outputs  sjv.i (tk\n)  (blue  ▲),  SN,2{tk',n)  (red  01 
error  bounds  (solid  lines)  as  functions  of  time  for  three  values  of  /. i . 


sN,3(tk;n)  (black  x), 


and  associated 
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(a)  (b) 


Figure  7:  Subdomains  (7a)  for  the  “Swiss  cheese”  problem;  each  subdomain  is  associated  with  a  distinct 
parametrized  thermal  conductivity.  Interior  cutaway  view  (7b)  of  the  “Swiss  cheese”  mesh  which  is  composed 
of  linear  tetrahedra  and  has  32,307  vertices  and  168,926  elements.  A  once-uniformly-refined  version  of  this 
grid  which  was  used  for  the  “truth”  calculations  has  241,520  vertices  and  1,351,408  elements.  All  grids  used 
in  this  work  were  created  with  the  Gmsh  [26]  mesh  generator. 


(a)  (b) 


Figure  8:  Primal  (8a)  and  dual  (8b)  truth  solutions  at  t  =  1  and  t  =  0,  respectively,  for  the  parameter  in 
(9). 
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Figure  9:  Greedy  convergence  for  the  primal  basis  for  three  different  choices  of  retrain- 
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(a)  Greedy-selected  (b)  Log-uniform  random 


Figure  10:  Histograms  of  averaged  greedy-selected  (10a)  and  log-uniform  randomly  selected  (10b)  parameter 
values  for  the  “Swiss  cheese”  problem  with  ?itrain  =  106.  The  log- uniform  histogram  was  created  from  a 
large  set  of  sample  data  points  and  scaled  to  match  the  averaged  greedy-selected  (/x)  values.  The  bias  of  the 
greedy-selected  values  towards  the  endpoints  0.5  and  2  of  the  parameter  domain  relative  to  the  log-uniform 
sample  is  evident. 
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Figure  11:  Histograms  of  average  jumps  (<$/z)  from  (12)  in  spatially- adjacent  greedy-selected  (11a)  and  log- 
uniform  randomly  selected  (lib)  parameter  values  for  the  “Swiss  cheese”  problem  with  ntrain  =  106.  The 
log-uniform  histogram  was  created  from  a  large  set  of  sample  data  points  and  scaled  to  match  the  averaged 
greedy-selected  (6/j)  values. 
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